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Abstract 


The ability to recognize the shape and movement of hands can help improve 
the user experience in a wide range of technical domains and platforms. It can 
help you understand sign language and move your hands in the right way, for 
example. It can also make it possible for digital information and materials to 
be added on top of the real world in augmented reality. Here, I talk about a 
real-time, on-device hand gesture recognition solution that lets us control our 
system’s graphical user interface (GUI) with static and dynamic hand gestures 
that can be trained to do a set of actions that are similar to what we do with our 
mouse and keyboard. It is built with MediaPipe-Hands, which finds the palm 
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Pont Histanj based Classi landmark point. The data is then sent through a pipeline of data- preprocessing 
tation: functions and trained with two models: one for static gesture recognition and 
LSTM one for dynamic gesture recognition. In real-time, the models are then used to 


detect similar gestures on-device from a video-capturing device like a webcam. 


1. Introduction frame to reduce performance overhead. For accu- 
racy, a hand landmark model is provided with a 
properly cropped image of the hand to focus on pre- 
cise coordinate prediction. To decrease data aug- 
mentation, the previously-identified hands are used 
to create the crops. If the landmark model is unable 
to recognize the hand, palm detection is used to 
locate it again. We implemented the hand land- 
mark tracking as a MediaPipe graph, with a spe- 
cialized Hand-Renderer subgraph handling the ren- 
dering. The palm detection module employs both a 
hand landmark subgraph and a palm detection sub- 
graph. 


To enhance the user experience in various techni- 
cal fields and platforms, recognizing hand shape and 
motion is crucial. It can be used for augmented real- 
ity, sign language, and hand gesture control. We 
propose a real-time hand gesture recognition system 
using the MediaPipe-Hands API and a deep learn- 
ing model, which eliminates the need for specialized 
equipment. This system works on personal comput- 
ers and mobile devices, only requiring a webcam or 
mobile camera. Our approach addresses the previ- 
ous limitations of specialized hardware or compu- 
tationally intensive methods for real-time execution 


on mobile devices. The output from the MediaPipe API, which is 


Our hand gesture recognition system detects the 
hand region from the input video stream using the 
MediaPipe API and tracks the hand from frame to 


OPEN ACCESS 


in the form of landmark points, is analyzed and 
recorded in a CSV file for training for a specific ges- 
ture. Once trained with several gestures, the out- 
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put model can detect the trained gestures from any 
simple video feed in real-time. The remaining sec- 
tions of this paper are structured as follows. Sec- 
tion II presents related studies, followed by Sec- 
tion III providing an overview of the methodol- 
ogy. In subsequent sections, Section IV and Sec- 
tion V, we present the detailed framework and exper- 
iment results, respectively. Lastly, in Section VI, 
we present the conclusion summarizing the entire 
research. 


2. Related Studies 
2.1. Hand Shape based methodology 


One approach to recognizing dynamic sign language 
involves analyzing the properties of hand forms 
and motion trajectory, which is a common practice. 
Some researchers, like Kim et al. (Kim et al.), have 
focused on hand shape characteristics to recognize 
fingerspelling. However, this method is limited to 
simple motions, such as alphabets and numerals, 
and cannot identify more complex gestures without 
considering hand motion. To overcome this limita- 
tion, local features from depth and intensity images 
are learned using the unsupervised deep learning 
method PCANet. A linear support vector machine 
classifier is then used to recognize the extracted fea- 
tures (S. Aly et al.). The pattern recognition sys- 
tem transforms an image into a feature vector and 
compares the feature vectors of a training set of ges- 
tures (Maung). 


Other researchers, such as Haroon et al. (Haroon 
et al.), have proposed using artificial neural net- 
works to recognize gestures with symmetric pat- 
terns under varying illumination conditions. Mean- 
while, Mohandes et al. (Mohandes, Deriche, J. Liu, 
et al.) used long-short-term memory to detect hand 
motions based solely on hand motion trajectory. 
Additional research has classified hand movements 
using sensor technology such as the leap motion 
controller (Sonawane et al.), digital gloves, sur- 
face electromyography accelerometers, and gyro- 
scopes (K. Li, Z. Zhou, Lee, et al.)— (Ma et al.). 
However, these methods are limited to specific hand 
movements like waving and gesticulating. Kumar et 
al. (Kumar et al.) presented a multimodal framework 
for detecting sign language using Kinect sensors (K. 
Lai, Konrad, Ishwar, et al.), while Wang et al. (H. 
Wang, Chai, and Chen) used sparse observation to 
recognize sign language based on hand postures and 
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motions with RGB-D data. Both have successfully 
conducted research in this field. However, sensor- 
based systems are not practical or user-friendly. As 
the number of sign language vocabularies increases, 
sensor-based systems become more complicated. 
Recognizing the entire sign language vocabulary 
requires analyzing hand form details and motion tra- 
jectory, which can be challenging for users since 
it often requires sensors. Additionally, understand- 
ing dynamic sign language is difficult due to the 
complex and changeable motion trajectory driven by 
hand-body joint interactions. 


2.2. Continuous Frame based methodology 


To recognize sign language, researchers have 
explored various approaches that do not require sen- 
sors, such as using advanced algorithms to analyze 
video sequence characteristics. For example, Cui et 
al. (Cui, H. Liu, C. Zhang, et al.) developed a recur- 
rent convolutional neural network for continuous 
sign language recognition using video sequences, 
and Huang et al. (Huang et al.) proposed a hierar- 
chical attention network framework for global-local 
video feature representations. 


Some researchers have also used deep learn- 
ing to achieve dynamic sign language recognition, 
including convolutional neural networks (CNNs) to 
extract features from hand gestures (Krizhevsky, 
Sutskever, Hinton, et al. Donahue et al.), recur- 
rent neural networks (RNNs) to learn video 
sequences (Maraqa, Abu-Zaiter, et al. Murakami, 
Taguchi, et al.), (Srivastava, Mansimov, Salakhudi- 
nov, et al.), and combined CNNs and RNNs to learn 
spatiotemporal sequence features (Baccouche et al. 
Ng et al.). Other techniques include fuzzy classifica- 
tion (J. Li and D. Zhang), self-supervised contrastive 
pre-training (X. Zhang et al.), neural network clas- 
sification (Kang, Tripathi, Nguyen, et al.), and 
sequence-to-sequence learning (Liao et al.). Com- 
pared to methods relying on hand forms and motion 
trajectories, these video-based approaches have 
shown higher performance in detecting dynamic 
sign language. However, most of these methods 
were not specifically designed for sign language 
recognition and may struggle to identify complex 
sign language with varied hand forms and motion 
trajectories. 


Although 
bined motion 


some researchers have com- 
information into static pho- 
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tographs (K6piikli, Kose, Rigoll, et al.), this 
technique has limited utility since not all hand 
motion information can be captured in_ static 
images. Another study (Z. Liu et al.) proposed a 
spotting detection-based paradigm for large-scale 
continuous gesture recognition. 


3. Methodology 
3.1. Dataset 


The dataset used in this study consists of hand ges- 
tures captured using MediaPipe Hands, a technol- 
ogy that detects and tracks key points, or land- 
marks, of a hand in an image or video. The dataset 
comprises 21 landmark points that are identified 
for each hand gesture, including gestures like fist, 
wave, and thumbs up. Each gesture has over 1000 
recorded instances, each with their respective land- 
mark points, resulting in a significant amount of data 
that can be utilized for training and evaluating hand 
gesture recognition models. Moreover, new datasets 
can be created very quickly on the fly, and the model 
can be retrained within seconds, making it a highly 
flexible and efficient system. 


3.2. Experimental Procedure 


Our hand tracking solution employs a machine 
learning pipeline that consists of multiple mod- 
els working in tandem. The first model, 
BlazePalm (Amangeldy et al.), is a palm detector 
that operates on the entire image and produces a 
bounding box around the hand. This results in a 
more precise image of the hand, thereby reducing 
the need for data augmentation techniques, such as 
rotation, translation, and scaling. This allows the 
network to focus on improving its precision. 

0. WRIST 

1. THUMB_CMC 


2. THUMB_MCP 
3. THUMB_IP 


11. MIDDLE_FINGER_DIP 
12. MIDDLE_FINGER_TIP 
13. RING_FINGER_MCP 
14. RING_FINGER_PIP 


4. THUMB_TIP 15. RING_FINGER_DIP 
5. INDEX_FINGER_MCP 16. RING_FINGER_TIP 
6. INDEX_FINGER_PIP 17. PINKY_MCP 

7. INDEX_FINGER_DIP 18. PINKY_PIP 

8. INDEX_FINGER_TIP 19, PINKY_DIP 


9. MIDDLE_FINGER_MCP = 20. PINKY_TIP. 
10. MIDDLE_FINGER_PIP 


FIGURE 1. Landmark Points 


The second model is a hand landmark model that 
works on the cropped image region defined by the 
palm detector and produces high-fidelity 3D hand 
landmark keypoints. 

When the palm prediction suggests that the hand 
is lost, a single-shot detector model designed for 
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mobile, real-time applications similar to BlazeFace 
is used to identify initial hand positions. This model 
can identify occluded and self-occluded hands while 
working with a wide range of hand sizes, which 
is challenging because hands do not exhibit high- 
contrast patterns. 

The hand landmark model uses regression to 
locate 21 2.5D locations inside the identified hand 
areas. Even when the hand is partially visible or 
occluded, the model learns a consistent internal 
hand posture representation and is resilient. It has 
three outputs: the x and y coordinates of the land- 
mark points, a hand flag that indicates whether or not 
there is a hand in the input image, and a binary clas- 
sification of handedness, such as “left” and ’right.” 

Finally, there are two types of gesture recogni- 
tion models used. The first model is a keypoint- 
based model used for static gestures, where the rel- 
ative landmark points remain the same throughout. 
The second model is a point-history-based gesture, 
where the gesture is dynamic and consists of a series 
of changing relative landmark points. 


4. Model Architecture 
4.1. Keypoint-based model architecture 


The described model is a feedforward neural net- 
work (FFNN) that utilizes fully connected layers. 
The model comprises a sequence of layers, where 
the output of each layer serves as the input for the 
subsequent layer. This neural network is also known 
as a dense neural network or a multilayer percep- 
tron (MLP). The input to the model is a 1D array of 
length 42, and the output is a probability distribution 
over the output classes. 


Input Layer Output Layer 


Hidden Layers 
x \ 


FIGURE 2. A Multilayer Perceptron 


The architecture of the model is sequential and 
takes preprocessed data as input. The first layer is an 
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Input layer that takes input of shape (21 * 2), repre- 
senting 21 landmark points with (x, y) coordinates. 
The second layer is a Dropout layer with a rate of 
0.2, which randomly sets 20% of the input units to 0 
during each training batch to avoid overfitting. 

The third layer is a Dense layer with 20 units 
and a rectified linear unit (ReLU) activation function 
that applies a ReLU activation function to the out- 
put of the previous layer. The fourth layer is another 
Dropout layer with a rate of 0.4, which randomly 
sets 40% of the input units to 0 during each training 
batch. 

The fifth layer is another Dense layer with 10 
units and a ReLU activation function. Finally, the 
output layer is a Dense layer with N units and a soft- 
max activation function. 

The softmax activation function in the output 
layer is particularly suitable for multi-class classi- 
fication problems. It transforms the output of the 
previous layer, which is a vector of arbitrary values, 
into a probability distribution over the classes. The 
softmax function normalizes the output values such 
that they add up to 1, ensuring that the output val- 
ues can be interpreted as probabilities. The highest 
probability value corresponds to the predicted class 
for a given input sequence. Therefore, the softmax 
activation function is used to produce a probability 
distribution over the output classes and to make the 
final prediction. The model architecture is designed 
for a classification problem with N output classes. 


0.0000 
0.0000 
0.2900 
-0.1336 
0.4565 
01235 
0.3521 


0.7856 


5H2= JeAe] 118y esueg 


nu > Jeez yndu 


0.2897 
0.5235 
0.9068 
-0.4575 
-0.4521 
0.4568 
1.0560 
-1,0000 


ap B (Z'0)184e7 ynodoiq 
ol > (v°0) Jahe7 ynodoig 


YW > JeAe| 78y asuaq 
y JaAe7 xewyos indjno 


FIGURE 3. Keypoint based model architecture 


In this model, the MediaPipe Hands API is used to 
obtain landmark coordinates of a hand from a video 
sequence. Each frame of the video contains a set of 
21 landmark coordinates, which correspond to key 
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points on the hand, such as fingertips, knuckles, and 
the wrist. The raw landmark coordinates are then 
processed using a 4-stage preprocessing pipeline to 
transform them into a suitable format for the model 
to learn from. 

In the first stage of preprocessing, the landmark 
coordinates of each frame are recorded and stored 
in a 2D array of shape (21, 2). Each row of the array 
represents the x and y coordinates of a single land- 
mark point. This step helps to capture the spatial 
information of the hand gestures and represents the 
hand pose in a structured format that can be used as 
input to the model. 

In the second stage of pre-processing, the coor- 
dinates are converted into relative coordinates with 
respect to a reference point, which is typically the 
wrist landmark point with index value 0. This trans- 
formation helps to eliminate the effect of translation 
of the hand in the video. Specifically, the x and y 
coordinates of each landmark point are subtracted 
from the x and y coordinates of the wrist landmark 
point, respectively. This step helps to normalize 
the position of the hand and makes the model more 
invariant to hand movement in the video. 

In the third stage of pre-processing, the relative 
coordinates are flattened into a 1D array of length 
21 * 2, where each element in the array represents 
a single coordinate value. This step helps to create 
a fixed-length input representation for the model to 
learn from. By flattening the 2D array, the model 
can treat each coordinate value as a separate input 
feature, which makes it easier to learn complex rela- 
tionships between the input and output. 

In the final stage of preprocessing, the values in 
the flattened array are normalized to the maximum 
absolute value. Normalizing the data helps to reduce 
the effect of scaling in the input data, which can 
improve the performance of the model. Specifically, 
by scaling the values to lie between -1 and 1, the 
model can learn more efficiently and avoid issues 
such as vanishing gradients. 

Overall, the data preprocessing steps help to 
transform the raw landmark coordinates into a suit- 
able format for the model to learn from. The result- 
ing input representation captures the key features of 
the hand gestures and provides a suitable input for 
the model to learn from. This approach has been 
shown to be effective in improving the performance 
of hand gesture recognition models, and it can be 
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extended to other applications such as sign language 
recognition and human-robot interaction. 


® (Landmark coordinates) 


ID:0 ID:1 ID:2 D3 | sce ID: 17 ID: 18 ID: 19 D: 20 
(551, 465) | [485, 428] | [439,362] | [408, 307] | ----- (633, 315] | [668,261] | [687,225] | [702, 188] 
® (Convert to relative coordinates from |D:0) 
ID:0 ID:1 ID:2 Did Jesse ID: 17 ID: 18 ID: 19 D : 20 
[0, 0] [-66, -37] | [-112, -103] | [-143, -158] | ------ (82, -150] | (117, -204] | [136, -240] | [151, -277] 
@(Flatten to a one-dimensional array) 
ID:0 ID:1 ID:2 D3 [ise ID: 17 ID: 18 ID: 19 D: 20 
0 0 -66 | -37 | -112 | -103 | -143 | -158 | ---- 82 | -150] 117 | -204] 136 | -240} 151 | -277 
® (Normalized to the maximum value(absolute value)) 
ID:0 ID:1 ID:2 D:3 je} ID: 17 ID: 18 ID: 19 D: 20 
0 0 | -0.24) -0.13} -0.4 | -0.37 | -0.52 | -0.57 | «+ 0.296 | -0.54 | 0.422] -0.74 | 0.491) -0.87| 0.545) -1 


FIGURE 4. Point-History based model architec- 
ture 


The model described below is a type of neural 
network known as a Sequential model. It is com- 
monly used for processing sequential data, such as 
time series, natural language text, and speech recog- 
nition. The architecture of this model includes sev- 
eral layers, each with a specific function in process- 
ing the input data. 
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FIGURE 5. Point history based model architec- 
ture 


The Input layer is the first layer of the model, 
which expects an input of shape (TIME STEPS * 
DIMENSION,). Here, TIME_STEPS represents the 
number of time steps or intervals over which data 
is collected and processed for each instance, while 
DIMENSION represents the number of features in 
the input data. In this model, DIMENSION is set to 
2, which corresponds to the x and y coordinates of 
each landmark point in the input data. 

The Reshape layer is used to convert the 1D input 
array into a 3D tensor of shape (TIME STEPS, 
DIMENSION). This reshaping is necessary for the 
input to be compatible with the LSTM layer, which 
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requires a 3D input shape. The first Dropout layer 
is added to the model to prevent overfitting, which 
is acommon problem in neural networks. Overfit- 
ting occurs when the model learns the training data 
too well and fails to generalize to new, unseen data. 
The Dropout layer randomly sets 20% of the input 
units to 0 during each training batch, which helps to 
prevent the model from overfitting. 

The LSTM layer is a type of recurrent neural 
network that is well-suited for processing sequen- 
tial data. It has 16 units and an input shape of 
[TIME_STEPS, DIMENSION]. This layer is partic- 
ularly useful for modeling sequences of data, as it is 
capable of capturing long-term dependencies in the 
input sequence. 


hy 


Cc 


X, Viewer does not support full SVG 1.1 


FIGURE 6. General scheme of a LSTM cell 


The second Dropout layer is added after the 
LSTM layer to further prevent overfitting. This 
Dropout layer randomly sets 50% of the input units 
to 0 during each training batch. 

The Dense layer is added to the model after the 
Dropout layer. It has 10 units and a ReLU activa- 
tion function, which applies a rectified linear unit 
(ReLU) activation function to the output of the pre- 
vious layer. The ReLU activation function is com- 
monly used in neural networks and helps to intro- 
duce non-linearity into the model. 

Finally, the output layer is a Dense layer with 
NUM_CLASSES units and a softmax activation 
function. This layer is responsible for produc- 
ing the output of the model, which in this case is 
a classification of the input data into one of the 
NUM_CLASSES output classes. 

The data preprocessing in this model is a crucial 
step that transforms the raw landmark coordinates 
into a suitable input format for the LSTM layer to 
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® (Time series coordinates) 


To T2 TA ii 
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0.0000 | 0.0000 |-0.0250 | 0.0204 [-0.0427 | 0.0426 | ------ 0.0979 | 0.1000 | 0.0979 | 0.0574 | 0.0958 | 0.0241 
° ° 
FIGURE 7. Stages of point history pre- 


processing 


learn from. Since the data is a time series of land- 
mark coordinates, it is essential to select the appro- 
priate frames that capture the motion of the hand 
over time. Therefore, the second stage involves 
selecting the last 16 frames for each gesture, which 
are referred to as the keyframes. These frames serve 
as the input for the model, and they are used to cap- 
ture the temporal evolution of the hand gestures. 

In the third stage, the relative coordinates of each 
landmark point are calculated with respect to the 
position of the first landmark point. This step helps 
to eliminate the effect of translation of the hand 
in the video, which can affect the accuracy of the 
model. The fourth stage involves normalizing the 
relative coordinates to the range of [0, 1]. Normal- 
izing the data helps to reduce the effect of scaling in 
the input data, which can improve the performance 
of the model. The model gets its information from 
the normalized and flattened list of landmark coor- 
dinates that was made. The length of the array is 
determined by the number of keyframes (16), the 
number of landmark points per frame (21), and the 
number of coordinates per landmark point (2). This 
input representation captures the temporal evolution 
of the hand gestures, providing a suitable input for 
the LSTM layer to learn from. 

Thus, the data preprocessing steps in this model 
are crucial for transforming the raw landmark coor- 
dinates into a suitable input format that captures the 
temporal evolution of the hand gestures. The result- 
ing input representation is used to train the LSTM 
layer of the model, which is well-suited for process- 
ing sequential data and capturing long-term depen- 
dencies in the input sequence. 


5. Results 


5.1. Dynamic Gesture Recognition 


Examples of Dynamic Gesture Recognition is 
shown in the Figure 8. 
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FPS:16.6 
Finger Gesture:Clockwise 


Gesture: Clockwise 


FPS:15.7 
Finger Gesture:Counter Clockwise 


Gesture: Counter-Clockwise 


FPS:14.92 
Finger Gesture:Move 


Gesture: Move 


FIGURE 8. 
Recognition 


Examples of Dynamic Gesture 


5.2. Static Gesture Recognition 


Examples of Static Gesture Recognition is shown in 
Figure 9. 


5.3. Confusion Matrix for Key-point based Model 


This is a classification report that shows the perfor- 
mance of a classification model on a test set. The 
report shows the precision, recall, and Fl-score for 
each class, as well as the accuracy and weighted 
average across all classes. Precision is the propor- 
tion of true positive predictions (correctly classi- 
fied instances) out of all positive predictions (total 
instances predicted as positive). Recall is the pro- 
portion of true positive predictions out of all actual 
positive instances in the test set. The Fl-score is the 
harmonic mean of precision and recall, and provides 


387 


Rishav Nath Pati e¢ al. 


FPS:19.16 
Finger Gesture:Stop 


FPS:19.16 
Finger Gesture:Stop 


Left: Drag 


FPS:19.32 
Finger Gesture:Stop 


Right: Close 


FIGURE 9. Examples of Static Gesture Recogni- 
tion 


a balanced measure of the model’s performance. In 
this report, we see that the precision, recall, and 
Fl-score for all classes are 1.00, indicating that the 
model achieved perfect classification performance 
on the test set. This means that the model correctly 
classified all instances in the test set for each class. 
The accuracy of the model is also 1.00, which fur- 
ther confirms the perfect performance of the model 
on the test set. The macro average and weighted 
average F1-scores are also 1.00, which indicates that 
the model has excellent overall performance across 
all classes. 


5.4. Confusion Matrix for Point history based 
Model 

This is a classification report that provides vari- 

ous evaluation metrics such as precision, recall, 
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Classification Report 
precision 


recall fi-score support 


a 1.00 1.00 1.00 8 

1 1.00 1.00 1.00 9 

2 1.00 1.00 1.00 13 

3 1.00 1.00 1.00 9 

4 1.00 1.00 1.00 9 

5 1.00 1.00 1.00 7 

6 1.00 1.00 1.00 6 

yA 1.00 1.00 1.00 13 
accuracy 1.00 74 
macro avg 1.00 1.00 1.00 74 
weighted avg 1.00 1.00 1.00 74 


FIGURE 10. Confusion Matrix for Key-point 
model 


and fl-score for a multi-class classification prob- 
lem. The report shows the performance of the model 
on a dataset of 1341 samples, where the model 
achieved an accuracy of 0.95. The macro average 
and weighted average for precision, recall, and f1- 
score are also provided 


6. Conclusion 


In conclusion, this paper presents a real-time on- 
device hand gesture recognition solution that uti- 
lizes the MediaPipe-Hands API and deep learning 
models, implemented as TensorFlow Lite models, 
for static and dynamic gesture recognition. This 
approach is able to run in real-time on both PC 
and mobile devices without the need for special- 
ized hardware, and the tf-lite models make it easy 
to deploy on mobile hardware. The system works 
by detecting the hand region from the input video 
stream, tracking the hand from frame to frame, and 
utilizing a hand landmark model to increase accu- 
racy. The palm detection module is also utilized 
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~ 350 


Classification Report 
precision recall f1-score support 


@.95 1.00 @.97 385 
0.94 0.98 @.96 301 
@.93 0.98 @.95 301 
0.97 @.87 @.91 339 
0.00 0.00 0.00 15 


UWNnNRPRS 


accuracy 0.95 1341 
macro avg 0.76 0.77 0.76 1341 
weighted avg 0.94 @.95 0.94 1341 


FIGURE 11. ConfusionMatrix for Point-History 
model 


to help locate the hand when the landmark model 
is unable to recognize it. The output from the 
MediaPipe API is used to train the model for dif- 
ferent gestures, and once trained, the model can 
detect these gestures from any simple video feed. 
This technology has potential applications in a wide 
range of fields, including but not limited to: sign 
language interpretation, hand gesture control for 
virtual and augmented reality, gaming, and assis- 
tive technology for people with mobility impair- 
ments. Future scope of work includes:a. Incorpo- 
rating more complex gestures: The current model 
is trained on a limited set of hand gestures. Future 
work can involve expanding the gesture vocabulary 
and training the model on a more diverse set of ges- 
tures.b. Improving speed and accuracy: While the 
proposed solution is real-time, there is still room 
for improvement in terms of speed and accuracy. 
This can be achieved through optimizing the model 
architecture and hyperparameters, as well as using 
more powerful hardware.c. Multi-user support: Cur- 
rently, the proposed solution is designed to recog- 
nize gestures from a single user. Future work can 
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involve extending the system to support multiple 
users simultaneously.d. Integrating with other com- 
puter vision tasks: Hand gesture recognition can 
be combined with other computer vision tasks such 
as object detection and facial recognition to cre- 
ate more sophisticated applications. This can be 
achieved by integrating multiple models and devel- 
oping more complex algorithms. 
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